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Abstract - Several characteristics of written texts have been inferred from statistical analysis 
derived from networked models. Even though many network measurements have been adapted 
to study textual properties at several levels of complexity, some textual aspects have been disre¬ 
garded. In this paper, we study the symmetry of word adjacency networks, a well-known represen¬ 
tation of text as a graph. A statistical analysis of the symmetry distribution performed in several 
novels showed that most of the words do not display symmetric patterns of connectivity. More 
specifically, the merged symmetry displayed a distribution similar to the ubiquitous power-law 
distribution. Our experiments also revealed that the studied metrics do not correlate with other 
traditional network measurements, such as the degree or betweenness centrality. The effective¬ 
ness of the symmetry measurements was verified in the authorship attribution task. Interestingly, 
we found that specific authors prefer particular types of symmetric motifs. As a consequence, 
the authorship of books could be accurately identified in 82.5% of the cases, in a dataset com¬ 
prising books written by 8 authors. Because the proposed measurements for text analysis are 
complementary to the traditional approach, they can be used to improve the characterization of 
text networks, which might be useful for applications such as identification of topical words and 
information retrieval. 


Introduction. — In recent years, network science has 
become commonplace. Many real systems such as the In¬ 
ternet, social networks and transportation systems have 
increasingly been studied via networked models [^. Be¬ 
cause language is organized by rules and relationships be¬ 
tween words in a complex way, it can also be represented 
as networks. In this case, words are connected according 
to syntactical or semantical relationships [^. The use of 
the network framework not only allowed for a better un¬ 
derstanding of the origins and organization of language , 
but also improved the performance of several natural pro¬ 
cessing language tasks, including e.g. the automatic sum¬ 
marization of texts , the identification of word senses 
and the classification of syntactical complexity [^. 

Many measurements proposed for analyzing complex 


networks have been reinterpreted when applied to analyze 
linguistic features. Centrality measurements, for exam¬ 
ple, have been useful to identify core concepts and key¬ 
words, which in turn have allowed the improvement of 
summarization and classification tasks [^. While a myr¬ 
iad of measurements have been adapted to probe tex¬ 
tual patterns, only a limited number of studies have been 
devoted to devise novel network measurements that are 
able to identify more complex linguistic patterns. Par¬ 
ticularly, a relevant pattern that has not been addressed 
by networked-linguistic models is the quantification of the 
heterogeneity of specific textual distributions. This is the 
case of the spatial distribution of words along the text, 
which has been mainly studied in terms of the burstiness 
(or intermittency) of time series |^. Another interest- 
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ing pattern concerns the uneven distribution of the num¬ 
ber of distinct neighbors of words [^. In this context, 
we introduce two network measurements to quantify the 
heterogeneity of accessing words neighbors in word adja¬ 
cency networks. As we shall show, the adopted measure¬ 
ments, henceforth referred to as symmetry measurements, 
are able to characterize authors’ stylistic marks, since dis¬ 
tinct authors display specific bias towards particular net¬ 
work motifs. In addition to being useful to improve the 
characterization of word adjacency networks, we found out 
that the symmetry measurements do not correlate with 
other traditional network measurements. Therefore, they 
could be useful to complement the characterization of text 
networks in its several levels of complexity. 


Methods. — In this section, we describe the formation 
word adjacency networks from raw books. The symmetry 
measurements, namely backbone and merged symmetry 
are then described. Furthermore, we present a short in¬ 
troduction to the pattern recognition methods employed 
in this study. 

Word adjacency networks. Written texts can be mod¬ 
eled as networks in several ways i- If one aims at grasping 
stylistic textual features, networks generated from syntac¬ 
tical analysis can be employed [^|^. Another possibility 
is to map texts into a word adjacency network (WAN), 
which links adjacent words [9,11 12 . Actually, WANs 
can be considered as an extension of the syntactical model 
since most of the syntactical links occur between adjacent 
words [^. Because syntax depends upon the language, 
WANs have also proven useful to capture language depen¬ 
dent features 13 . 

To construct a word adjacency network, some pre¬ 
processing steps are usually applied. First, stopwords 
such as articles and prepositions are removed because such 
words convey no semantic information. Therefore, they 
can be modeled as edges in the WAN model because stop- 
words usually play the role of linking content words. In 
order to represent as a single node the words that refer 
to the same concept, the text undergoes a lemmatization 
process. Hence, nouns and verbs are mapped to their sin¬ 
gular and infinitive forms, respectively. To minimize the 
errors arising from the lemmatization, before this step, 
all words are labeled with their part-of-speech 14 . In 


the current study, the maximum-entropy model devised 
was used to perform the part-of-speech labeling. 
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After the lemmatization and the removal of the stopwords, 
each distinct word is mapped into a node and edges are 
created between adjacent words. Further details regarding 
the WAN model can be found in [^ . 

Symmetry in networks. Symmetry is one of the most 
fundamental aspects of complex systems, naturally emerg¬ 
ing from physical spatial restrictions and laws 
organization 17 , biological structures 
reactions 


19 


18 


16 


self 
and chemical 


etc. 


Written texts bear no exception to 
this rule, presenting intrinsic patterns of symmetry. In a 
sentence, for instance, some words can be exchanged by 


synonyms without compromising its original meaning. In 
a similar fashion, some grammatical constructions are also 
interchangeable. Aside from restrictions conveying seman¬ 
tic relationships and grammatical rules, authors also tend 
to employ additional restrictions in their works, which in 
turn affects their written style. Whenever texts are repre¬ 
sented by networks, it is expected that such styles may be 
reflected on the symmetrical characteristics of its topolog¬ 
ical structure. 

While the concept of symmetry in graph theory is 
tightly related to the problem of finding and counting au¬ 
tomorphisms, this approach cannot be straightforwardly 
extended to study most of real complex networks 20 . Re¬ 


cently, practical definitions of symmetries for real networks 
have been proposed in the literature. They include path 

the methods based on quantum 
The latter presents 


similarity techniques [20 
walks 


21 and concentric rings 22 


some advantages over the other strategies. For example, 
the symmetry can be calculated locally around nodes in 
a multiscale fashion, defined in terms of node centered 
subgraphs referred to as concentric patterns conceptually 
linked to the concentric levels of a node. The concentric 
level Th{i) is defined as the set of nodes h hops away from 
the original node i and the concentric /-pattern is the sub¬ 
graph comprising only nodes located / or less hops away 
from i, i.e., nodes in the set UL=o^^(^)- 

The concentric symmetry approach is based on the ac¬ 
cessibility measurement 23 , which is calculated as a nor¬ 


malization of the entropy obtained from the transition 
probabilities for a network walk dynamics, such as the 
traditional random walk or self-avoiding random walk. In 
particular, the symmetry is obtained considering a very 
special case of walk dynamics in which an agent never 
goes back to a node belonging to a lower concentric level. 
However, to account for the degeneracy caused by con¬ 
nections between nodes in the same concentric level, two 
transformations of concentric patterns were proposed, re¬ 
sulting in two types of symmetry measurements: backbone 
and merged symmetries. The backbone symmetry, Sb, is 
loosely based on the concept of radial symmetry, in which 
edges among nodes in the same concentric level are re¬ 
moved for the pattern. Differently, the merged symmetry, 
Sm, that bears some resemblance with angular symmetry, 
is obtained from patterns by effectively merging nodes in 
the same concentric level. In both cases, the symmetry 
measurements Sh for level h centered on i are calculated 
from the Shannon entropy of the transition probabili¬ 
ties Ph{i j)‘ More specifically. 


exp 


Sh{^ = 


E Ph{i 


■ j)\n[Ph{i ^ j)] 


|U(i)| 


■ Z^r=0 


( 1 ) 


where stands for the number of dead ends in level r (i.e., 
nodes with no connections to any node in the next con¬ 
centric level). Fig. [^illustrates the backbone and merged 
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Fig. 1: Example illustrating the calculation of the backbone and merged symmetries for two concentric 2-patterns. The numbers 
next to each node account for the transition probabilities and colors indicate the respective concentric level of a node (blue 
for level 0, orange for level 1 and green for level 2) [^. Red self loops indicate a dead end. Both transformations of patterns 
are shown. The backbone pattern is obtained by removing edges connecting nodes at the same level from the original pattern, 
whereas merged patterns are weighted subgraphs created by merging nodes originally connected at the same concentric level. 
In this case, the weight corresponds to the number of connections spanning from the nodes that were merged to each node in 
other concentric levels. Note that the pattern in the left panel only presents merged asymmetry, while the pattern in the right 
panel presents both types of asymmetry, which is also confirmed by the symmetry values. 


transformations for two patterns alongside the transition 
probabilities and calculated symmetries. 


Pattern Recognition Methods. Pattern recognition 
methods are useful to identify patterns and infer classi¬ 
fiers 24 . Particularly, in this study, pattern recognition 


methods were applied to recognize patterns in the distri¬ 
bution of symmetry measurements across distinct authors. 
Four pattern recognition methods were employed: support 
vector machines (SVM), multilayer perceptron (MLP), 
nearest neighbors (KNN) and naive Bayes (NBY). These 
four methods were chosen because they usually display a 
good overall performance [25^ An introduction to these 


methods can be found in 25 26] . We also provide a very 


short introduction to these methods in the Supplementary 
Informatiori3 


Results and discussion. — This section is divided 
in two subsections. Firstly, we study the statistical prop¬ 
erties of symmetry measurements in word adjacency net¬ 
works. We then show how the symmetry of specific words 
can be employed to discriminate authors’ styles. The list 
of books employed in the experiments is shown in Table 
SI of the Supplementary Information. 

Properties of merged and backbone symmetry in word 
adjacency networks. We start the investigation of the 
statistical properties of symmetry measurements in tex¬ 
tual networks by analyzing the distribution of symmetry 
values in real networks formed from books. Here we focus 
our discussion on the book “Adventures of Sally”, by P.G. 
Wodehouse. Notwithstanding, all discussion henceforth 
applies to the other books of the dataset. Concerning the 
merged symmetry, all books displayed a probability den- 


^The Supplementary Information is available from https ://dl. 
dropboxusercontent.com/u/2740286/symmetry.pdf 


sity function with the following logistic form: 


P{Sm) 


Ai — A2 

1 + {Sm/So)P 


+ ^ 2 , 


( 2 ) 


where Ai, A 2 , So and p are constant. According to the 
equation!^ high values of symmetry are very rare. This is 
similar to other well-known distributions in texts, such as 
the frequency distribution given by the Zipf’s law ^7\ . 
Fig. |2] illustrates the histogram of symmetry distribu¬ 
tion obtained for the book “Adventures of Sally”, by P.W. 
Wodehouse. For this book in particular, the p.d.f of the 
merged symmetry in equation can be written as 

“ l + {S^/So)P’ 


where A = 1.0136, Sq = 0.0136 andp = 1.25348. The high 
value of adjusted Pearson {R^ = 0.99181) and low value 
of chi-square = 1.43261 • 10“^) confirm the adehenrece 
of the fitting. 

Unlike the merged symmetry, the backbone counterpart 
displayed a distribution of values with two typical peaks, 
as revealed by Fig. (see left panel). The first peak of 
distribution occurs around ^ 0.3. While low values 
of backbone symmetry are very rare, high values are fre¬ 
quent, especially on the less frequent words. This occurs 
because smaller (or lowly connected) concentric patterns 
are more unlikely to accumulate enough imperfections over 
the concentric levels to attain very low symmetry values. 
On the other hand, larger patterns do not present such 
constraints and can attain many distinct levels of symme¬ 
try. 

While several traditional centrality network measure¬ 
ments correlate with the node degree, the proposed sym¬ 
metry measurements for text analysis usually do not yield 
a strong correlation with the connectivity of nodes. In Ta¬ 
bles and we show, in the same row, words with sim¬ 
ilar degree taking very discrepant values of merged and 
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Fig. 2: Histograms of the distribution of the backbone Sb and 
merged symmetries Sm (computed at the second level) for the 
book “Adventures of Sally”. The merged symmetry computed 
at the second level seems to follow a logistic function (see equa¬ 
tions and |^ . A similar distribution was found for the other 
books of the dataset. 


backbone symmetries. For example, in Table the words 
bathing and mother occur with the same frequency; how¬ 
ever, the respective values of backbone symmetry are quite 
discrepant. As a matter of fact, the access to the second 
level neighbors is much more regular for the word mother^ 
as it backbone symmetry is close to the maximum possible 
value, i.e. max(5'5) = 1. 


Table 1: Merged symmetry (second level) computed for se¬ 
lected words in the book “Adventures of Sally”, a novel by 
P.G. Wodehouse. Note that words with similar degree k (the 
words in the same line) may take distinct values of symmetry. 


Word 

Sm 

k 

Word 

Sm 

k 

Cracknell 

0.011 

31 

hotel 

0.024 

33 

heart 

0.012 

27 

corner 

0.029 

27 

gentleman 

0.012 

26 

conversation 

0.041 

24 

revue 

0.012 

21 

rise 

0.062 

21 

notice 

0.013 

17 

blow 

0.080 

17 

cold 

0.014 

11 

tongue 

0.094 

10 

luck 

0.017 

6 

wealth 

0.331 

6 

meditate 

0.020 

5 

banquet 

0.259 

5 


The correlation between symmetry and other tradi¬ 
tional topological measurements were also investigated. 
According to Fig. there is no consistent, significant 
correlation between symmetry and other network mea¬ 
surements. This means that the values of both merged 
and backbone symmetry cannot be mimicked by other well 
known network measurements. Therefore, the symmetry 
measurements provide novel information for network anal¬ 
ysis. 

Authorship recognition via network symmetry. In this 
section, we exemplify the discriminability power of sym¬ 
metry measurements in word adjacency networks. More 
specifically, we show that the symmetry os specific words is 
able to identify the writing style of distinct authors. In the 


Table 2: Backbone symmetry (second level) computed for se¬ 
lected words in the book “Adventures of Sally”, a novel by P.G. 
Wodehouse. Note that words with similar degree k (the words 
in the same line) may take distinct values of symmetry. 


Word 

56 

k 

Word 

56 

k 

hair 

0.196 

30 

hotel 

0.472 

33 

heart 

0.211 

27 

corner 

0.412 

27 

manner 

0.190 

26 

conversation 

0.544 

24 

chapter 

0.127 

22 

york 

0.579 

19 

water 

0.132 

16 

disappear 

0.610 

14 

note 

0.052 

10 

mysterious 

0.709 

8 

memory 

0.089 

6 

secure 

0.904 

6 

bathing 

0.071 

5 

mother 

0.932 

5 
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Fig. 3: Pearson correlation coefficients between symmetry and 
other traditional network measurements. Note that, in general, 
there is a weak correlation between symmetry and other mea¬ 
surements. The correlations were obtained from the word adja¬ 
cency network obtained from the book “Adventures of Sally”, 
by P.G. Wodehouse. 


context of information sciences, the authorship recognition 
task is relevant because it can be useful to classify liter¬ 


ary manuscripts 28 and intercept terrorist messages 29 


Traditional features employed for stylometric analysis in¬ 
clude simple statistics such as the average length and fre¬ 


quency of words 30 , richness of vocabulary size 30 and 


burstiness indexes 

To evaluate the ability of the symmetry measurements 
to recognize particular authors’ styles, we used a dataset 
of 40 books written by 8 authors (see Table SI of the 
Supplementary Information). As features for the classi¬ 
fication task, both merged and backbone symmetry were 
computed for the 229 words appearing in all books of the 
dataset. To automatically recognize and classify the pat¬ 
terns displayed by each author, we used the four pattern 
recognition techniques described in the methodology. The 
accuracy rates in identifying the correct author are shown 
in Table With regard to the performance of the pat¬ 
tern recognition methods, the best results were obtained 
with the SVM and MLP methods. When the symmetry 
was computed considering the second level of neighbors 
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(h = 2), the best accuracy rate achieved was 75.0% (this 
corresponds to a p-value lower than 1.0-10“^^). Both sym¬ 
metries measurements calculated at the third level did not 
increase the best classification performance obtained with 
h = 2. A minor improvement in performance occurred 
when the fourth level was included in the analysis. The 
best accuracy rate increased from 75.0% to 82.5%. We also 
probed the performance of the classification by combining 
different levels as features. In this case, the performance 
did not improve (result not shown). All in all, these re¬ 
sults confirms the suitability of symmetry measurements 
to identify the subtleties of authors’ styles in terms of the 
homogeneity of accessibility of neighbors. 


Table 3: Accuracy rate found for the authorship recognition 
task. The best accuracy rate found to recognize the authorship 
in a dataset comprising 8 authors was 82.5%. 


Symmetry 

SVM 

MLP 

KNN 

NBY 

Merged h = 

2 

75.0% 

72.5% 

55.0% 

42.5% 

Merged h = 

3 

70.0% 

62.5% 

65.0% 

40.0% 

Merged h = 

4 

82.5% 

82.5% 

57.5% 

42.5% 

Backbone h 

= 2 

32.5% 

32.5% 

20.0% 

20.0% 

Backbone h 

= 3 

70.0% 

72.5% 

57.5% 

27.5% 

Backbone h 

= 4 

70.0% 

82.5% 

57.5% 

42.5% 


To understand the patterns behind the high discrim- 
inability rates found in Table we show some visualiza¬ 
tions obtained for two words, “time” and “indeed”, in 
Fig. a We chose these words because they were able 
discriminate among a few groups of authors while also 
presenting a wide range of symmetry values. The pat¬ 
terns obtained for the word “time” are arranged along the 
top of the corresponding axis according to their respective 
merged symmetry, which was found to separate Arthur 
Conan Doyle^ Thomas Hardy and Charles Darwin. Note 
that the nodes with low merged symmetry presented sev¬ 
eral edges crossing over the internal shell of its patterns. 
Additionally, connections between nodes lying at the third 
concentric level are much less organized, hence the low 
values of symmetry. Conversely, nodes taking high values 
of symmetry displayed more organized connections, lead¬ 
ing to higher uniformity of connections among nodes lying 
at the farthest concentric level. The same observations 
can be made for the patterns obtained for the word “in¬ 
deed” , which discriminated between Hector Hugh Munro 
and the group of authors encompassing Arthur Conan 
Doyle, Bram Stoker, Thomas Hardy and Charles Dickens. 
Still, however, these patterns are much more symmetric, 
which is once again reflected in the visualizations by their 
higher organization on the last concentric level. 

Conclusion. — In this paper, we have introduced the 
concept of symmetry to study the connectivity patterns 
of word association networks. By defining symmetry as a 
function of particular random walks, we showed that the 


symmetry measurements are robust in the sense that they 
do not mimic the behavior of other traditional topological 
measurements. Thus, because symmetry measurements 
do not strongly correlate with other traditional network or 
textual features, they could be combined with other mea¬ 
surements to improve the characterization of texts rep¬ 
resented as graphs and related networked systems. The 
proposed symmetry measurements were also evaluated in 
the context of the authorship recognition task. The re¬ 
sults revealed that the symmetry of specific words is able 
to identify the authorship of books with high accuracy 
rates. This result confirms the suitability of the measure¬ 
ments to detect the subtleties of authors’ styles reflected 
on the organization of word adjacency networks. In future 
works, we intend to study the suitability of both backbone 
and merged symmetry in semantical networks, which may 
ultimately lead to the improvement of several semantical- 
related applications. 
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